DOCUMENT RESUME 



ED 227 133 



TM 830 125 



AUTHOR 
TITLE 

iNSTITUTION 

spons Agency 
report no . 

PUB DATE 
CONTRACT 
NOTE • 

AVAILABLE FROM 
PUB TYPE 



Hedges, Larry V. 

Statistical Methodology in Meta-Analysis. , , 
ERIC Clearinghouse on Tests, Measurement, and 
Evaluation, Princeton, N.j; 

National Inst, of Education (ED), Washington, DC. 

ERIC-TM-83 

Dec 82 

400-78-000g 

79p. . • . " ' 

ERIC/TM, Educational Testing Service, Princeton, NJ 

08541 ($7.00) . 

Information Analyses - ERI,C Information Analysis r% 
Products (071) — .Guides - Non-Classroom Use (055) • / 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



MF01/PCd4 Plus Postage." 

Analysis of Variance; Correlation; Error oi 
Measurement; *Mathemat ical' Models; *Research 
Methodology; Research Problems; ^Statistical 
Analysis; Statistical Studies / 
*Effect Size; Glass (G V); *Meta Analysis 



ABSTRACT 

-Meta-analysis has become an important supplement to 
traditional methods of research reviewing, although marry problems 
must be addressed by the reviewer who carries out a m£ta-a.nalysis. 
These problems include identifying and obtaining appropriate studies, 
extfacting estimates of effect size from the studies, coding or 
classifying studies, analyzing the data, -and repotting the results of 
the data analysis. Earlier work by Glass, McGaw , and Smith describes 
methods for dealing with .these problems: and has generated a great 
intWPSst in the development of systematic statistical theory for 
metaanalysis. This monograph supplements the existing, literature on 
✓itfeta-analysis by providing a unified treatment of rigorous 
statistical methods for meta-analysis. These methods provide a 
mechanism for responding to criticisms of meta-analysis, such a^that. 
meta-analysis may lead to oversimplified conclusions or be influenced* 
by design flaws in the original research studies. Contents include: 
indices of effect size, statistical analysis of effect size d&ta, wr 
assumptions and the statist ical^vnodel, estimations of effect size, an 
analogue to the analysis of variance for effect sizes, the effects of 
measurement errbr on effect size; statist ical„ analysis when 
correlations or proportions are the index of effect magnitude', and 
statistical analysis for correlations as effect magnitude. 
(Author/PN) ' . / 



***************************************** ****************************** 

* Reproductions supplied by EDRS are the best that can be made * 

* * ,/<from'*the original document. * 
******************** *^t ************** *********************************** 



ERLC 



U.S. DEPARTMENT OF EDUCATION 
NATIONAL INSTITUTE OF EDUCATION 

EDUCATlONAl.RESOURpfs' INFORMATION 

CENTER <ER*ICI 
y, This document has been reproduced as 

received from the person or organization 

originating it 

Minor changes have been made to'improve 
reproduction quality 



Points of view or opinions stated in this docu 
mem do not necessarily represent official NIE 
position or policy . < 



STATISTICAL METHODOLOGY 
' IN 
META-ANALYSIS 



ER1C/TM REPORT 83 



by- 
Larry V. Hedges 



"PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 

3 M~0*>fJl£VlMJhy 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC)" 




E P ■ r^ 7 ] ERIC CLEARINGHOUSE ON TESTS. MEASUREMENT. & EVALUATION 
Eg r€ 1 ! EDUCATIONAL VESTING SERVICE, PRINCETON. NEW JERSEY 08541 



ERIC/TM -Report 83 



STATISTICAL METHODOLOGY IN META-ANALYSIS 



by 



Larry V. Hedges 



The University of Chicago 



December 1982 



ERIC Clear ingfious"e on Tests, Measurement, and Evaluation 
Educational Testing Service, Princeton, NJ 08541 



i 



ERIC 



o 



The material in this publication was prepared pursuant to a 

contract with the National Institute of Education, U,S. 

^Department of Education. Contractors undertaking such projects 

under government sponsorship are encouraged to express freely 

t 

their judgment in professional and technical matters. Prior to 
publication, the manuscript was submitted to qualified 
professionals for critical review and" determination of 
professional competence*. This publication has met such 
standards. Points of view or opinions, however, do not 
necessarily represent the official view or opinions of either 
these reviewers -or the National Institute of Education, 

% . ERIC" Clearinghouse on Tests, 

« 

Measurement, and Evaluation 
Educational Testis Service 
* 4 Princeton, NJ 08541 



Table oY Contents 

Introduction. . . * *....*.....'. 3 

Indices of Effect Size..... «. 7 ' 

The Effect Size as an Index of Effect Magnitude 9 

Other Indices 'of Effect Magnitude , 10 

The Method of Meta-Analysis : H 

The- Statistical Analysis of Effect Size Data ; *13 

Assumptions and' the Statistical Model 13 

Intimating Hi feci Size 14 

Figure L.. t " *• 16 

An Unbiased Estimator of Effect Size .* 17 

The Asymptotic Distribution of the Unbiased Estimator .......18 

Testing Homogeneity of Effect Size • 19 

' Assessing Variability of Effect Sizes ^ 2i 

Table 1 * ; 23 

Figure 2 24 

Estimation of Effect Size from a Series of Homogeneous 

. Studies . 25 

♦ • 

EfficLcncv of the Weighted Estimator 28 

f * • 

An Analogue to' the Analysis of Variance for Effect Sizes 29 t 

Testing Homogeneity across Classes .^.....32 

Testing Homogeneity of Effect Sizes within Classes 33 

An Analogy to the Analyses of Variance .* 34 

Computational Forifrulas for H,^ H and H^ 35 

Fitting Effect Size Models to a Series of Studies 36 

Comparisons between Classes . . u . .38 

An Analogue to Multiple Regression for Effect Sizes 40 

c Testing Model Specification 46 



i*- u ii'tit out i out l inu J 

* 

i on.put i n \ t >l iiuutes and lest Statistics N * 

Ihe i^iedr* o: Measurement trror on Kffeet Size ) 

ihu Lrioct^ oi Departures from Linear Kqua lability . A > 

Statistical Analysis When Correlations or Proportions' 

J 

are the Index of Effect Magnitude "5 

Statistical Analysis' for Correlations as Lffoct Magnitudes > 

statistical Analysis for Differences in Proportions h 

* oac lus ions h 

KeJ o*vnoo;-> > 



/ 



'1 



0 



INTRODUCTION ) 

* 

The educational research enterprise has grown tremendously 
in the last thirty years. The literature in many areas* of 
education and psychology has produced hundreds of stucjies on the* 
same topic. Yet few would argue that the knowledge base of the 
social sciences has grown as rapidly as the volume of research 
studies. Some critics and many reviewer^ contend that our statue 
of knowledge has remained unchanged despite the best efforts of 
the social science research community. Until recently, research 
reviews that yield equivocal conclusions have been the rule 
rather than the exception. Glass (1976) noted that "the typical 
Teviewer concludes' that the research is in horrible shape; 
sometimes one "gets results, sometime^ one does). 

The recurrence of equivocal conclusions from research 
reviews led some investigators to speculate that the process of 
research review might be at fault. Light and Smith (1971) were 
aitiong the first investigators to examine the problem of 
integrating the results of quantitative studies in the social 
sciences. They demonstrated the importance of systematic 
analysis of variations in design and execution of studies as 
well as thfe variation in study outcomes. 

Light and Smith also generalized an approach from cluster 
sampling to generate an extensive algorithm and analysis 
strategy for a series of similar experiments: Unfortunately, 
their approach requires access to the original data which limits 
its practical usefulness ih research integration. 



4 . 

' . 1 

Light and Smith asserted, that, at that time, a technique 

called vote-counting was the most commonly used method of 

integrating research studies. In their formulation, a number bf 

studies compare the soores of tes of two groups; one group of subjects 

receives an experimental treatment and the other group receives 

no treatment. In the vote-counting method the available studies 

are sorted into three categories: those that yield* positive 

\ 

significant results, those that yield negative significant 
results, and those that yield nonsignificant results. 



H a, plurality of studies falls into any of these 
three categories , Vi ih. fewer tailing into the other 
two, the nodal category is declared the winner. This 
modal categorization is then assumed ^10 &Lvj* the best 
estinate of the true relationship between the independent 
and dependent variables. (Light ,\ Smith, 1971, p. 433). 



f 

Despite the^ obvious simplicity of vote-count ing methods , 
these techniques have very serious problems. The deficiency of 
vote-counting methods stems from their reliance on tests of 
statistical significance in individual research studies. Hedges 
and Olkin (1980) proved that when studies typically use smly use small 
samples or when the phenomenon under study produces small , 
effects, vote-counting methods systematically fail to detect 
effects. The reason for this behavior is related to the low 
statistical power of significance tests when effects or sample 
sizes are small. Small effects.are the rule rather than the 
exception in socipI science research. 



5 



For example, Gage < 1978) has noted* that the magnitude of 
the relationship between any teaching variable and achievement- 
is likely to be small, although the cumulative effect of many 
such variables need not be negligible. Similar arguments have 
been made about the magnitude of relationships in social 
psychology . 

The consequence of small effects and sample sizes 'on the 
.power of statistical analyses in educational and psychological 
research is illustrated in surveys of statistical power of 
published research. Brewer (1972) calculated the power of* 
studies published in threp educational research journals. His 
anal ys is showed tha t the # 
analysis showed that the power of published studies to detect 
small effects (* mean difference of 0.2, in standard deviation - 
units) was uniformly low. Only two per cent of the 55 studies 
surveyed from the American Educational Research^Sjfournal had a 
power ^eater than 0.3 to detect an effect that" small. Thus the 
probability of Type II errors (i.e., failuf^'to reject the null 
hypothesis when it is false) seems unacceptably high in these 
studies. Similar results have been found in surveys of studies 
in abnormal psychology (Cohen, 1962), cpmmunication research 
(Katzer & Sodt , 1973), atid appl ied- psychology (Chase & Chase, 
1976). If these surveys of social science research are 
representative, failure to reject the null hypothesis in 
individual' research studies canno't provide much assurance that 
small effects are not present. 



A new approach to the problem of research integration was 
proposed by Glass (1976), He argued that est ima Lion of the 
magnitude of the experimentperimental effect is perhaps more important 
than statistical significance. Glass suggested that the "effect 
size 11 in a, two-group experiment be defined as rhe difference 
between the experimental and control group means ' divided by the 
control group -standard deviation. Glass coined /the term 

"meta-analysis" to describe the analysis of these "effect sizes" 

\ * - 
from a series -of. studies. 

/ 
/ 

Meta-analysis has become an important supplement to 
traditional "methods of research reviewing, largely as a result 
of the work of Glass and his colleagues. They demonstrated that 
the technique could be used to provide sensible answers to 
fundamental questions in the behavioral sciences. The firSt 
application of me ta-anay lysis was the integration of studies on 
the effects of psychotherapy (Smith & Glass, 1977). This first 
meta-analysis intrigued many alui stirred controversy for others. 
A series of other analyses, including the m&ta-analyses. of the 
effects of class-size (Glass & Smith, 1979; Smith & Glass, lith & Glass, 1980) 

have continued to provide strong evidence^on long standing 

, * 

controversies* The interest generated by these and other 

examples, along with a lucid treatment of the methods of 

0 

meta-analysis (Glass, 1978) h^ve encouraged other investigators 
to use the -technique. 

Many problems must be addressed by the reviewer who 
carries out a meta-analysis. These problems include identifying 
and obtaining appropriate studies, extracting estimates of 



ERIC 



' 7 



effect size from Che studies, coding or classifying studies, 
analyzing the -data, and reporting the results of the data -^^y^ 
analysis. Some of these problems are similar to the problems 
faced by the primary researcher (see Jackson, 1980, or Cooper, 
1982). In other cases the problems are not the same -as those 
faced in primary research'. The best source on methods for 
conducting meta-analyses is the book by Glass, McGaw, and Smith 
(19#1). This book contains opularizing the method of raeLa-analysis . 

Why then another paper on a reasonably complete coverage of 
methods to deal with all of the problems mentioned above as well 
as numerous others. This book draws upon the considerable 
resources of three author^ who have been instrumental in 
developing and popularizing the method of meta-analysis. 

Why then another paper on methodology for meta-analysis? 
Since the publication of the Glass, McGaw , and Smith book, there 

4 

has been a great deal of interest in the development of 
systematic statistical theory fctr meta-analysis. Many of the t 

erf 

techniques proposed and used by Glass and his associates were 

* 

sensible, but suboptimal. Recent work in the statistical theory 
for meta-analysis has provided simple 'methods that can be 
rigorously justified. The purpose of this monograph is to 
supplement the existing literature on meta-analysis by providing 
a unified treatment of rigorous statistical methods for 
meta-analysis . ' 

to 

Indices of Effect Size 
Statistical methods have been used to combine information 
from different research studies for many years. Some of the . 



earliest examples cf this work are found in the work on 
combining the results of agricultural experiments* Cochran 
(1937) considered the problem of combining estimates c^f . 
treatment effects from a series ^of similar experiments. He 
considered several methods of weighting estimates from each 
experiment. Yates and Cochran (1938) developed more refined 
weighting methods (e.g., partial weighting) for combining 
estimates from several agricultural experiments*., A more recerit 
review of statistical work in this tradition is given lh Cochran 
(1954). Tests of the statistical significance of combined % 
results were also introduced in connection w.ith problems of 
combining results of' several studies. in agriculture and biology 
(e.g., Tippett, 1931; Pearson, 1933). « • 

Th'e early work^on combining the "results of studies in 
agricul ture involved combining the results of studies that share 
a common, well-defined dependent variable. For example, the 
object of a research synthesis in agriculture might be *to 

* L " \ •• . " ' 

Combine estimates of the barley crop. yield derived from several 
'studies. Each ^tudy would measure the dependent variable in» the 

same wdy: the number of* pounds of barley yielded per acre 
'planted. Therefore the means or treatment effect estimates 
deVived as mean differences are directly comparable and can be"^ 
directly combined by averaging. When a series of studies in the 
social sciences use the same measure of the dependent variable, 
me thods developed for combining the results of agricultural 
experiments can be used to combine estimates Qf the treatment 
effect* * 



For many commQn educational variables, a L variety of 
different psychological tests provide reasonably adequate 
measures of 'the underlying construct. Some authors (e.g., 
Campbell^ 1969) have argued that the different measures 
(operationalizations) of a construct are highly desirable. 
Studies of different operationalizations of a construct allow 
researchers to "triangulate" on the' construct . . Given a series 
of tests that are supposed to measure the same construct , one 
might ask what is meant by "measure the same* construct." One 
definition is that two tests qigfeure the same construct if true 



r 



scores on the tests are perfectly correlated. Tl\is implies that 

/ 

the tests are measures of the same construct if they are 
linearly equatable except for errors of measurement. This 
notion is the basis for deciding that two tests ate measuring 
the same thing if they yield intercorrelations^ that are about as 
high as their reliabilities will allow. * 

The Effect Size as an Index of Effect Magnitude 

Gllass (19J6) suggested the standardized mean difference or 
effect size as the scale invariant measure oP treatment effect. 

I ' . 

We define the effect size >S for an experiment as 



c 



where U and u * are the- experimental and control group 



population means and 



is the with in-group popiila t ion standard 



deviation. If the 'same experiment had been performed using a 



ERLC 



10 

* * 1 

different ( linearly/ equatable) measure of the outcome variable, 

the effect size would not change. The effect size is invariant 

undvr linear transformations of outcome variables. Therefore the 

effect size provides an index of effect magnitude that is 

independent of the particular test used to measure a construct , 

We emphasize that the effect size is only invariant under 
o 

substitution of (linearly equatable) measures of the same 

construct. Effect s^zes are not invariant under nonlinear 

repealing . Similarly^, there is no reason to believe that effect 

sizes derivevi f r >m measures of one construct are equivalent to 
* 

effect sizes derived from measures of another construct. 

Different constructs will, in general, yield different effect 

sizes; Thus the -notion of effect size of a treatment should 

rea41y be considered as effect size of a treatment on a 

« 

cons true t . 

Other Indices of Effect Magnitude 

Glass (1978) also observed that the product moment 
correlation is a scale .invariant measure of the relationship 
between two continuous variables. That is, the correlation does 
not change as a result of linear rescalings of variables. He 
therefore suggested that correlation coefficients could bs used 
as indices of effect magnitude for studies examining the 
relationship between two continuous variables. In some cases, 
the natural index of effect magnitude is the difference between 
proportions o£ subjects that reach a cr i ter ion* in experimental 
and control groups The proportions are themselves scale 

f 

i 



11 



r 



invariant and can therefore be used directly to. compute a scale 
invariant index of effect magnitude such as the difference 
between proportions. 

The Method of Meta-Analysis 

The object of meta-analysis or other methods of 

quantitative research synthesis is to use data from a series of 

studies to obtain information about the effect size for a 

treatment on various constructs. This- usually involves obtaining 

an estimate of effect size from each study and, pooling 

(averaging) these estimates to obtain an estimate of the average 

effecfsize across studies (Glass, 19,76). In addition, the 

investigator may want to determine whether any characteristics 

of the studies are systematically related to effect size. 

Some writers in the area of research synthesis have cited 

substantive reasons for the position that different studies of 

the effects of the same treatment might yield quite different 

results. ^Light and' Smith (1971) argued that many contradictions 

in research evidence may be resolved by grouping studies with 

similar characteristics. They asserted that studies with the 

same characteristics are more likely to yield similar results, 

and hence many apparent . contradictions among research results. 

/ 

arise from differences in the characteristics of studies. 
Pillemer and Light (1980) have ?rgued that examining the 
relationship of variations in study outcomes and study 
characteristics is an essential step'in assessing the range of 
generalizability of a research finding. For example, if a 



. 12 

treatment produces essentially the same effect in awide* variety 
of settings with a variety of people, we are more confident in 
the general izabil ity of the finding of a treatment effect 
related, to effect size. The statistical analyses have sometimes 
involved regressing the effect size estimates obtained from a 
series of studies on variables that represent various 
characteristics of studies (Glass, 1976, 19780. Such methods > 
have been used, for example, in the meta-analysis of studies of 
the effectiveness of psychotherapy (Smith and Glass, 1977), the 
effects of class size on, achievement (Glass and Smith , 19 79) , 
and the effects, of television on achievement (Pascarella, 
Walberg, Junker, &"Haertel, 1981). 



4 



13 



r 



ERIC 



THE STATISTIC AL ANALYSIS OF EFFECT SIZE DATA v 

1 ; ~\ 

Assumptions and thg Statistical Model 
Many treatments of effect size have not adequately 
• emphasized the assumptions underlying effect size estimation and 
testing. GlacSS (1976) proposed the quantitative synthesis of the. 
results of a collection of experimental/control group studies by 
estimating a population effect size for each study and then 
combining the estimates across studies. The statistical analyses 
in such studies typically involve the use of a r- or F-test to 
test for differences betweert -the groups. If the assumptions for „ 
the validity of the t^test aVe met, it is^ossible to derive the 
properties of estimators of effect size exactly. We start; by 
stating these assumptions explicitly. 

Suppose .that the data arise from a series of k independent 
studies, where each study ^compares an experimental group (E) 
with an independent control group (C). Let Y^j and be the j 

scores on the i th experiment from the experimental and control 

E A >C 

groups, respectively- Assume that for fixed i f Y_ and Y ^ are 

r c -2 

) normally distributed with means uV and u\ and common variance , 

> 

ar\d 

v C . M(u C ,c<), J - I,...,n*r, i * l,...,k. 
13 i i 1 

In this notation, the effect size for the i^ h study (f> ± ) is 
defined as 



6. = 



/ e c 



i o. 



17 



where we use the Greek letter 5 to denote that this effect size 
is a populat ion parameter, 

frfcte that the assumptions of the £-test may not always be 
met in practice. They may never be exactly met. These 
assumptions are often reasonably well satisfied 'in practice, and 
the theory that follows, as vfell as that of the primary 
statistical analyses y will be a reasonable approximation to 
reality. Since the theory that follows relies on the properties 

of the 6-distribution, many of the results should be robust. In 

/ 

some situations, however, violations of the model assumptions 
will be severe. For example', the observations in each study 
might have a highly skewed distribution, in cases such as these, 
alternative statistical methods are necessary, UnfopHunatly 
there has been little work on statistical procedures for* 
meta-analysis with nonstandard models. One exception .is that 
Kraeraer and Andrews (1982) have provided a "nonparametric" 
-estimator of effect size. -Additional work is needed to provide a 
more complete theory for meta-analysis when standard assumptions 
are not tenable. Another important issu is the quality of the 
data reported in .studies to be combined. The quality of the x 
research synthesis is unlikely td be higher than that of the 
studies that go into it. This suggests that reviewers must 
carefully examine the studies before an attempt is made to 
combine the results of those studies , 

Estimating Effect Size 
The definition of effect size given' in. (I) above defines a 
population parameter <5. in terms of other population parameters 



15 

U*j\ # and o ± . Me will seldom, if ever, know the exact 

F C 

values of m[ , and thus we will have to est imate S^ ♦ 

i 

Glass (1976) proposed a statistic £ to estimate 6^ by 

E C • • • r ^ 

essentially replacing , V ± > and in the definition of ^ 

by their sample analogues. Specifically Glass proposed the 

estimator .g^ °f ^ > where g^ is^ defined by 

g! = 1 > 1 , i - l,...,k, (2) 
1 



EC . * 

where Y. and Y. are the experimental and control group staple 



means for the i th study A $ i is the control group sample 

standard deviation. Hedges (1981) has shown that under the 

assumptions of the previous section, the estimator (2) is 

biased. Figure 1 is a graphic representation of the 

relationship between the ratio of the expected value of g to the 

« 

true parameter value 6 as a function of the degrees of freedom 
in the estimate of J ^ . We see that the bias of g tends toward 
zero in studies with large sample 'sizes but can be substantial 
in studies with small sample sizes. 

If the assumption of <iqua.l population variances in 

experimental and control group*; holds, a less biased es>rtBator. 

C 

results when S i is replaced with the usual pooled wi thin-groups 



-A. X.) 



16 



Figure 1 




t > 



eric : 



17 



standard deviation. We denote this estimator by 'g^, that is, 



i 



9 

where S~ is the pooled . estimate of the variance 
i 



(n E - 1)(S E ) 2 + (nj - l)(sj) 2 
i <-2 = 1 i i i 



Si 

nT + n . - 2 



1 E . C 

r. . 
1 1 



We emphasize that li is a . sample statistic and therefore 
has a sampling distribution of its cwn. Our assumptions imply 
that gi is distributed as (l//n~)times a noncentral t random 
variable with n? + n J- 2 degrees of freedom and noncentrality 
parameter v r ~ n 7T" , where n. - njfif/ <i£ + nj ) . This distribution 
leads immediately to exact expressions for the bias and variance 
of g. , which are givgn in Hedges (1981). One should also note 
that g i is an inference sufficient statistic for 6^. 

An Unbiased Estimator of Effect Size 

A simple unbiased estimator -of <$ was obtained by Hedges 
(1981) based on the assumptions of t£§ previous section; The 
unbiased, estimator g is given by 



(4) 



8 

18 

K C 

where m = n , + n - 2 , c(m) is given exactly by 



, v r(a/2) 

' ^72 r[(m - l)/2] 



I' (x) is the gamma function and c(m) is given approximately by 

3 



c(m) = 1 - 



4m 



It is clear that as m becomes large, g^ tends to g. , so that g^ 

is almost unbiased in large samples. Since c(m) < 1, the 

variance of the unbiased estimator is always smaller than the 

variance of g • Hence g has uniformly smaller mean squared 
i i 

error that g The exact variance of g is 
l * i 



[c(n E + n C - 2)] 2 [iv E + n C - 2] £1 +*n.6 2 ] , 

^ 1 ■ 1 J: 3- _ 6 2 ? (6) t 



(ru + q? - 4)n. 
i x 



E c/ E • C 

where n = n n / (n "+ n ), and c(m) is given by £5). 

i i i i i 



The Asymptotic Distribution of the Unbiased Estimator 

In small samples, the estimator g^ of effect size has a 
sampling distribution that is a constant times the noncentral 
t-distribution. When the sample sizes in the experimental and 
control groups are large, however, the asymptotic distribution 
of g . provides a satisfactory approximation to the exact 



ERIC 



19 

distribution of g ± . The large sample approximation is given by 



g t * ^(6., o2(6.)); " (7) 



where 



E C o 

n .n . 2(n . + n . ) 

ix i x 



and we use the expression to indicate that the variunQe* 

of g i depends op the true effect size This large sample 

approximation is used by substituting an estimator of the effect 
size for 6,. iii (8). In the case of a single effect size, we 

substitute g. for 6. in (8) to obtain an expression for the 

1 x 

variance of g. . A useful "guideline on what constitutes a large 
sample is n E , n C 10. If the sample size of' either group is 
smaller than about 10, it may be desirable to omit the study 
fram data analyses since the estimate of effect size is so 
imprecise that it is almost useless, 

V 

Testing Homogeneity of Effect Size ^ 

Before pooling estimates of effect size from a series of k 
studies, it is important to ask whether the studies can 
reasonably be described as sharing a common effect size. A 
statistical test for the homogeneity of effect size is formally 



ERIC 



20 « 

a test of the hypothesis 

V 6 i ~ 5 > 1 * 1 k > 

Versus the alternative that at least one 6. differs from the 

i 

rest. 

A large sample (approximate) test for the equality of k 
effect sizes given by Hedges (1982a) uses the test statistic 



r 



' k Cg< - e-) 2 

H '= S K ~ , (9) 

W o2( gi ) . 

where g. is the weighted estimator of effect size given below 4 n 
(13). 

The test statistic H P is the sum of squares of the g about 

* i 
. th 

the weighted mean g., where the ^ square is weighted by the 
reciprocal of the estimated , variance of g^. The defining formula 
(9) is helpful in illustrating the intuitive nature of the 
statistic H^, but a computational formula is more useful for 
actual calculation of.H . The computational formula is 



\ 2 



z - 
7k 



(° ) 2 ' 1 " C*(" ) 



(10) 



where 3 ( 5^) i^ given by (-8)." A similar test is given by 
Rosenthal and Rubin (1982).* 



21 

When each study has a large sample size (a reasonable 
guideline is n E , n C > 10), Che asymptotic distribution of H T can 
be used as the basrs for an approximate test of the homogeneity 

of the 5 .'If all the k studies have the same population effect 

i : 

size (i.e., if is true) then the test statistic has an 
asymptotic chi-square distribution given by 4 



»T % X k-1 ■ 



Therefore if the obtained value of H t exceeds the lOO(l-a) per 

cent critical value of the chi-square distribution with (k-1) 

degrees of freedom, we reject the hypothesis that the $ ± are 

equal. If this nAl hypothesis is rejected, a conservative 

individual may decide not to pool all of the estimates of 6 since 

they are not estimating the same parameter. When the sample 

sizes are very large, however, it is probably worthwhile to 

consider the actual variation *n the values of g., since rather 

small differences may lead to large values of the test 

statisfic: If the g values do not differ much in an absolute 

1 % 
sense, the investigator may elect. to pool the estimates even 

though there is reason to believe that the underlying parameters 

are not identical; A less conservative investigator might pool 

estimates regardless of the outcome of tests of homogeneity. 

*» 

Assessing Variability .of Effect Sizes 

It is often helpful to plot the effect sizes from a series 
Q f studies to assess the variability of the g^ values. The large 
sample approximation (7) may be used to obtain a confidence 



interval for each effect size. An approximate 100(1- a) per cent 
qonfidence interval for . is given by 



1 



ERIC 



•where 2 > is the 100 a per cent critical value of the standard 
normal distribution and * ( g ) is the Targe sample variance of g. 
given^by (8). Plotting each g. value along with a confidence 
interval.ior each g. gives ah idea of the region in which 'the 
corresponding •. is likely to be. Therefore substantial overlap 
of these confidence intervals suggests that there is agreement 
among the g< oh a common effect size. Conversely, if some of the 
g t values are far from the rest, ana their associated confidence 
intervals do' not overlap much, then it may . be useful to consider 
these deviant values «s outliers. If there are only' a few 
outlying values then it may be helpful to treat these studies 
separately, and estimate a common effect" size from the other 
studies . 

The effect sizes from 10 studies of the 'effect of open 
education on attitude toward school are presented in Table 1 
along with sample sizes and the estimated sampling standard 
deviation :•.(.;.) for each study. The 95 per cent confidence 
intervals for these effect sizes' are plotted in Figure 2. We see 
that one effect size, that of study 10, is quite a. bit larger 
than the rest. Similarly,* the confidence interval for $ }Q fails 
to overlap with those of other studies. Calculations for the 
test of homogeneity are also given in TaMe 1, and we see that 
the value of the homogeneity statistic H,, ■ 19.40 which is 

• 1 

f 



~ u 



* 1 

23 



Table 1 ■ 

Effect Sizes from 10 Studies of the Effects of 
Open Education'on Student Attitude t,ovard School 



tudy 


E 

n 


C 

n 


g 


o ? (g) 


l/o 2 (g) 


g/o 


2 (g) 


g 2 /o 2 (g) 


1 


131 


138 


158 


.0149 


66.996 


10. 


585 


1.672 


2 


40 


40 


261 


i0504 


19.831 


5. 


176 


1.351 


3 


40 


40 


649 


.0526 


19.000 ■ 


12. 


331 


8.003 




79 


4 9 


503 


.0341" 


29. 365 


14. 


770 


7 .'429 


5 


84 


45 


.458 


.0349 


28.620 


13. 


108 


6.004 


6 


78 


55 


.577 


.0322 


31.004 


17. 


889 


10.322 


-» 


38 


110 


.588 


.0366 


27.341 


16. 


077 


9.453 


8 


38 


93 


.392 


.0376 


26.557 


10. 


410 


4 . 081 


9 


20 


23 


.055 


.0935 


, 10.694 




588 


0.032 


10 


',0 


40 


.332 


.0507 


19. 728 


-6. 


550 


2.175 






TOTALS 






279. 135 


°3. 


209 


50.522 



ERIC 



24 



4> 



/ 



Studv 1 



Studv- 2 



H Studv 3 



—I 



Studv 4 



Study £ 



Study 6' 



— I 



Studv 7 



Studv\8^- 



Studv 9 



■I .0 



i — ' — r~"" 

o.o 

Fffect Size 



~[ T 



Studv 10 



1.0 



Figure 2. Ninetv-five per cent confidence intervals for 
effect size*; iVr the' ton studies described in Table 1. . 



ERIC 



') # 



25 

significant beyond the a* .025 level. Deleting the effect size 
for study 10, we see that the effect sizes are reasonably 
homogeneous:' H,^ « 9.983, .10 < p < .05. 

Estimation of Effect Size from a Series of Homogeneous Studies 

If a series ,of k independent studies share a common effect 
size 5, it is natural to estimates by pooling estimates from*each 
of the studies. If the sample sizes of the studies differ, then 
the es-timates from some (the larger) studies will be more 
precise than the estimates from other (smaller) studies • In 
this case it is reasonable to give more weight to the more 

precise estimates when pooling. This leads to weighted 

\ ■ 

estimators of the form 

k 

Ew.g., (ID 
. , i i 



k 

where w. > 0, i = l,...,k, and I w. =1. It is easy to show 
i i=i l 



that the weights that minimize the variance of (11) are given by 



1/v. 

-r i- , i = l,..r,k, (12) 

1/v. 

J-l J 



where v. is the -variance of g given in (6). The practical 

i i 

problem in calculating the most precise weighted estimate is 



9 

ERJC 



26 

tb 

that the i weight depends on the variance of g^which in turn 
depends* on i • 

One approach to the problem of weighting results from 
dif feren^studies , is to use weights that are based on some 
approximation to the that does not depend on 6. This 
procedure results in a pooled estimator that is unbiased, but it 



will usually be less precise than if the optimal weights are 
used. For example, weights could be derived by assuming 'that 



v. - [c(n7 Kn? - 2)] 2 (n E + n C - 2)/n.(n? + n C - 4). 
. i 1 i i x i ii i 



The weights thus derived are only optimal if 5 - 0. If 6 is 

near. zero these weights will be close to optimal since v^ 

2 * • 

demands on 5 , which will be small. If a nonzero a priori 

estimate of, 5 is available, then weights could be estimated by 

'inserting that value of 6 in expression (6) for the variance of 

and using the formula (12) for w . In general the result will be 

an unbiased pooled estimator of 6 that is slightly less precise 

than the most precise weighted estimator. 

AnotheV approach to obtaining a weighted estimator of 6 is 

to estimated and use the sample estimate of $ to estimate the 

weights for each study. Define the weighted estimator *g. by 

i-JL o2(g.) ' (13) 



i-1 o2( 8i ) 



where ^(^) is given by (§). The estimator g. io therefore 



27 s • \ 

i 

i 

obtained by calculating the weights using g^^ for 5. in (8). 
Although" the g-. are unbiased, g. is not:. The bias of g. is small 
in large samples and tends to zero as the sample sizes tend to 
infinity* ^ 

This estimator could be modified by replacing by g. in 

2 

the expression fora (g ), and iterating. That is, calculate the 

(1) i i 

estimator g. . defined by 



(i) 



?rzxz 

i-1 o2(g.) 



(14) 



2 " • & 

where a (6 ) is given by (8). The iterated estimator g. will 
i i 

tend to be less biased than g. , If the effect size is 
homogeneous across experiments, the iteration process usually 
will not change the estimate very much. 

The asymptotic distribution of g. is easily obtained and 
can be used to obtain large sample confidence intervals for 6 

based on g. . The formal definition of 'large sample 1 in this 

E C 

case is that the sample sizes n^ and n^ i ■ l,...,k are tending 

to infinity at the same rate. A practical guideline for 1 large 

E C ... 
sample 1 is n , n > 10. The large sample approximation is 

g ; % M(5,o?(6)), (15) 

where 



o?(6) - 



(16) 



i=l o 2 (6) 



■> 

" ando"(5) is given by (8). We use this large sample • 



28 

approximation by substituting the (consistent) estimator g. for 5 
in (15)* A 100(1 -a) per cent asymptotic confidence interval 
for 5 is therefore 



where z /0 is obtained from a table of the standard normal 

a/2 

distribution. Similarly, an asymptotic test of the hypothesis 
fhat <$ =* 0 uses" the test statistic 



2(g.) 



(17) 



If the obtained value of z(g.) is larger in absolute value than 

% the 100(1 - a/2) per cent critical value of the standard normal 

distribution, we reject the hypothesis that 6 = 0 at the 100a 

per c^nt significance level. 

The formal asymptotic distribution of the iterated 
d ) . 

estimator g. is the same as that of g. • We use the large 



(i ) 



by 



sample approximation to the distribution of g 
substituting g/^ for 6 in (16). Therefore confidence intervals 
and significance tests for 6 based on g. are calculated in the 
same way as for g. . The only .difference when using g. is 
that g. is replaced by wherever the former occurs. 



Efficiency of the Weighted Estimator 

The weighted estimators discussed in previous sections were 

derived by finding the expression -for weights that minimize fthe 

variance of the resulting weighted estimator. -One might" ask 



9 

ERLC 



V 



29 

whether the best (most precise) weighted estimator is the most 
precise in somelarger class of estimators of effect size, 
including those that are not weighted linear combinations of the 
g L .' Hedges (1982a) showed that g. is asymptotically efficient 
in the sense that the asymptotic variance of g. is the 
theoretical minimum (Crame5-Rao bound). Thus no other consistent 
estimator has smaller asymptotic variance. This result implies 
that g. has the same asymptotic distribution as the maximum 
likelihood estimator of 5 based on k experiments. 



An Analogue to the Analysis of Variance fo r Effect Sizes 
The representation of the results of a collection of 
studies by a single estimate of -effect magnitude can be 
misleading if the underlying (population) effect sizes are not 
identical in all of the studies. For example, suppose a 
treatment produces large positive (population) effects in 
one-half of a collection of studies, and large negative 
(population) effects in the other half of a collection of 
studies. Then representation of the overall effect of the 
treatment as zero is misleading, because all of the studies 
actually have_ underlying effects that are different from zero. 
•The test for homogeneity of effect size given in (9) provides a 

i 

method for detecting heterogeneity of effect sizes. It will 
often be the case that a collection of studies cannot.be 
reasonably said to share the same effect size. For example, 
Giaconia and Hedges (1982) report the results of tests of 



3 J 



homogeneity for studies measuring the effect of open education 
on 19 different dependent variables. For each dependent 
variabfe> the hypothesis of homogeneity of effect size was 
easily rejected. 

Some investigators in quantitative research synthesis 
(e.g.. Kulik, Kulik, & Cohen, 1979) have recognized thg. 
^potential for heterogeneous effect s.izes and have grouped 
studies which share common characteristics into classes. The 
usual approach is then to treat the effect size estimates as 
data and calculate an analysis of variance to determine if these 
classes have different mean effect sizes. There are two 
problems with this procedure. First, the assumptions of the 
analysis of variance may not be met since' the effect size 
estimates may not have the same distribution within cells. The 
variance of aji individual observation (effect size estimate) is 
proportional to 1/n, wh6re n is the number of subjects in the 
study. 4 When studies have different sample sizes, the individual 
f, error M variances may differ by a factor'of 10 or 20 / Secondly,, 
even if the between-classes test were accurate % the use of ANOVA 
does not provide any indication whether or not studies within 
the classes share a common effect size. Thus, even if ANOVA 
correctly detects that two clakses^of studies have a different 
average effect size, there is no guarantee that the average 
effect size within each class is a reflection of a common 
underlying effect size for that class'. 

Hedges (1982b) presented an alternative technique for 
fitting models to effect sizes from a series of studies. We 
assume the investigator has an a priori grouping of studies, 



31 

that' is, a scheme for classifying studies that are likely to 
produce similar results. This will often take the form of a set 
of categories into which studies may be placed* Studies may be 
cross classified, by two or more sets of categories. The 
technique presented in /his section is straightforward • 
Conceptually the investigator begina by asking whether all 
studies (reg^dle.ss of category) sl)Are a common effect size, A 
statistical test (fit statistic) is provided by the test of 

homogeneity given in (9). If the hypothesis of fit to a single 

» t> 
effect size is rejected, the experimenter then breaks the series 

of studies into* classes , and asks whether the model of x a 

different effect size of each class fits the data. It is 

interesting tQ note that the fit statistic calculated at the 

first stage is partitioned into stochastically independent parts 

corresponding to between-class and v/ithin-class fit, • 

respectively. The between-class fit is an index of the extent to 

which effect sizes in the classes are different. If the 

within-class fit (fit to 9 single effect size within each class) 

is not rejected, the investigator may stop. If the within-class 

fit is rejected, the investigator may want, to further subdivide 

the classes. The process of subdividing and testing. for between- 

and within-class vf it continues until an acceptable level of 

within-class homogeneity is achieved. The procedure provides 

valid asymptotic tests for the effects of classifications as 

"well as an indication that the final classes are internally 

homogeneous with respect to effect size\ 



ERLC 



35 



32 

Testing Homogeneity across Classes 

Suppose that the entire collection of studies is divided 

into p a priori classes. The test for homogeneity across 

classes is essentially a test that the average effect size in 

each class is the same as the average effect size in every other 

class. Hedges (1982b) gave the test statistic to Ces't^for 

homogeneity of effect size across classes. The statistic H is 

. B 

given by 



Hg - l P Z ^ , (18) 

1 ' j=l iclj o|(g.) 

X 1 

where I is the sum over all studies with subscript in the 

.th , . . th 

J, class, gj is the weighted average effect size for the j_ 



class given by 



iclj of(g ) 
. = ? V-i-— m (19) 



3 ■ 



°i<«i> 



g.. is the weighted average effect size based on all of the 
studies given by (13) or alternatively 



J-l i clj c\( h ) ^ (2Q) 



_p _ J. 
j*l iclj o^(g i ) 



33 

2 

and a i (g ; ) is given in (8)* 

If the effect sizes are identical in each class, then the 

test statistic H given in (18), has an asymptotic distribution 

B * 

, given by 



h^4-i ' (21) 



Therefore the test of homogeneity of effect size across classes 
at a significance level a consists of comparing the obtained 
value of with the 100(1 -a) per cent critical value of, the 
chi-square distribution with (p - 1) degrees of freedom* If K 

D 

is greater than the critical value, we reject the* hypothesis of 
homogeneity of effect size across classes • 

Testing Homogeneity of Effect Sizes within Classes , 

The test of homogeneity of effect size within classes is a 
test whether all of the effect sizes within the same class share 
a common effect size* Hedges (1982b) gave the teat statistic H.^ 
for testing the homogeneity of effect size within classes* This 
test statistic is the. sum of the test statistics 1^ for the 
homogeneity of effect size within the class. Thus the 

statistic t^j is given by 

= ST Z — ^ — , (22) 

j-1 ielj O-CSi) 

2 

where Z and g are defined as in (19) and 0 (c ) is given 

ttlj ' 



in (8). Alternatively we could calculate r , as 

Wv • 



where 



34 



H = E P H ., 



(g 1 - s,) 2 



SJ ielj o2(g.) 

J 

r 

If the effect sizes within each class are homogeneous, then 



H has an asymptotic distribution given by 

W 



"w * X k-V (23) 



-P 



Therefore the test for homogeneity of effect sizes within 
classes consists of comparing the obtained value of with the 
100(-1 - a) per cent critical value of the chi-square 
distribution on (k - p) degrees of freedom. If the obtained 

value of H exceeds the critical value we reject the hypothesis 

* V W 

that the effect sizes are homogeneous within classes. In data 
analyses, it may be helpful to calculate the within-class fit 
statistics H for each of t the p classes. This may facilitate the 
identification of classes in which the fit is particularly bad. 

An Analogy to the Analysis of Variance 

There is a simple relationship among the fit statistics , 
and that is analogous to the partitioning of sums of 
squares in the analysis of variance. It is possible, to show 
that 



0 



x , 35 

using only elementary algebra* One interpretation of this 
formula involves .this partitioning of the fit statistic Hp. The 
"tot,,! fit" to the model of a single effect size is represented 
by ft«. The "between-class fit" is represented by and the 
"within-class fit" is represented by H^. Thus the total fit is 
partitioned into between-class and within-class components. We 
have stated that the statistics , i^, and are distributed 
asymptot ically as central ch i -squares under appropriate null 
hypotheses with distributions given by 



"t * *k-l • 



Furthermore, Hedges (1982b) has shown that H and H are 

B W 

asymptotically independent. Therefore the tests for between- 
and within-class fit are ^asymptotically independent. 



Computational Formulas for ft , , H , and'H 
— 77> 1 » w 

In practice, computational formulas can simplify 

* * * * 

calculation of the fit statistics H^, H^, and Hy. These 

formulas are much like the computational formulas in the 

analysis Qf variance. The computational formulas permit the 

researcher to compute each oi the fit statistics in a single 

pass .through the data wifch a packaged computer program. Each of 

the formulas can be verified by dir^eqt algebraic manipulation. 

The computational formula for H^, is given in (10), but is 

repeated here in different notation for reference. 

■ 3j 



36 



2 



4 \U L-r, w 

^ v-P p _U: — 



1 1 j = l ielj V°i ; ' 



S,- 



g| \iel 3 o*(g.) ) 

£ _ . , J = 1 D, 



rtJ ielj oJ(g.) 



«w - ? p A-i' 

3"! 



2 

where I is defined as in (18) and a. (g ) is given in (8). 
ielj ' 1 i 



Fitting Effect Size Models to a Series of Studies 

The statistical results of this paper can be used as part 
of a general strategy for fitting models to the effect sizes 
from a series of studies. Start witha series of studies where 
each study assesses the effect of a particular treatment via a 
two group experimental group/control group design. Suppose that 
the dependent variables measure the same construct and are 
(approximately) linearly equatable. We assume that the studies 
are classified according to one of the classification 
dimensions. The classes obtained by one partitioning may be 
further partitioned according to a second classification 



37 

dimension, and in turn partitioned according to other 
dimensions • 

One strategy for fitting models to effect sizes for each 
class, is analogous to the strategy used to fit h{erarchical 
log-linear models to contingency tables. The strategy cart be 
described as follows. 

Step 1. Ignore the classifications and fit the model of a 
single effect size to all the studies. The estimate of this 
•single effect size is g.. given by (20). Calculate the fit 
statistic H^. If the value of is not large or is 

statistically insignificant at some preset a level, the 
investigator may stop, concluding that the model of a single 
effect size fits the data adequately. The asymptotic 
distribution of g.. may be used to calculate an asymptotic 
confidence interval for 5. If the fit statistic is large or 
statistically significant, go on to Step 2. 

Step 2. A large value of the fit statistic H indicates 
that effect sizes are not homogeneous across all studies, so 
partition^the studies into classes along one dimension* One 
should choose the rooaX important dimension first, that is, the 
dimension believed to be most related ta effect size. 

Calculate the between-class fit statistic H and the 

within-class fit statistic H . If tlie value of the within-class 

W 

fit statistic H is small or is statistically insignificant, the 

W 

investigator may stop, since the model of a different effect 
size for each class is consistent with the data. In this case, 
g. given in (19) is the estimate of effect size for the j ^ 
class 



38 

and ^represents the extent to which the effect sizes differ 

among classes. If H is large or statistically significant, then 

W 

go on to Step 3. 

Ste p 3. A large value of the fit statistic H indicates 
^ — — W 

that effect sizes are not homogeneous with in classes. At this' 
point it may be useful to partitibn within-class fit into p 
(if there are p classes) statistics H^. & ■ l,...,p, where H^. 
indicates the fit within the ± class. Examining the values of 
fy,m$y help identify classes with especially poor fit, that ds, 
classes in which the effect sizes are heterogeneous. This may 
lead the investigator to exclude some classes or studies from 
further analyses. Examination of within-class fit may also 
suggest which other classification dimensions are useful. Go on 
to Step 4 . % 

Step 4. Partition the existing classes according to a 
second classification dimension. Repeat Step 2, that is, 

calculate the between- and within-class fit statistics H and H . 

B *W 

Proceed through Steps 2, 3, and 4 until an acceptable level of 
within-class fit is obtained or the classification dimensions ~ 
are exhausted. 

The procedure given is a practical method involving 
relatively simple calculations. It has the advantage that fit to 
the model can be assessed at each stage and it also provides a 
test of the relationship between the classification dimension 
and effect size. ^ 

Comparisons between Classes 

If ± priori knowledge or a formal hypothesis test 
(significant value of H ) lead an investigator to believe that 



39 

the effect sizes are not* homogeneous across classes , the 
investigator may wish to compare the effefct sizes of different 
classes. More generally, the investigator may wish to test 
hypotheses about linear combinations of the effect sizes for the 
classes. Such comparisons are analogous to contracts in the 
analysis of variance • 

The general comparison is a linear combination of the g 
of the form 



C = r S c , • (24) 



where the c , j * 1 p are known constants. In the case of a 

j 

comparison between two classes, for example one of the c_. might 
be +1, another might be -1, while the remainder might be zero. 
The comparison C given in (24) may be considered an estimate of 



C 6 B Z? - (25) 

t 

where T- is the weighted average population effect size in the j_ 
class given by 



6. 
l 



X , ~.A- (26) 



J * r V 

ielj o*(6.) < 



Surh comparisons are easiest to interpret when effect sizes are 



40 

homogeneous within classes since then 6. is simply the (common) 

J • 

effect size for the studies in the j class. 

Hedges (1982b) used the asymptotic distribution of g to 

j. 

obtain the large sam approximation to the distribution of C a , 



specifically 



C * N',(C,,o2) > 

6 d C 



where 0 is estimated by 

c 



C-2 ^ }: P 



c 2 



c W i r- (27) 

Therefore an approximate 100(1 -a) per cent confidence interval 
for C„ is given by 



: g - 2 a/2°c i C { l C g + Z o/2 5 c 



An^Analogue to Multiple Regression for Effect Sizes 

When effect sizes are heterogeneous across a series of 
studies, one strategy is to* relate discrete characteristics of 
studies to effect size perhaps by using the method given in 
previous sections. Another procedure is the is the application of 
regression analysis to the estimates of effect size. Glass 
(1978) recommended the general strategy of coding the 
characteristics of studies as a vector of predictor variables 



41. 

and then regressing the effect size estimate on the predictors 
to determine the relationship between characteristics of studies 
and effect size; For example, Smith and Glass (1977) used linear 
regression to determine the relationship between several coded 
characteristics of studies (e.g., type of therapy, duration of 
treatment, internal validity of the study) and the effect size 
in their me^a-analysis of psychotherapy outcome studies. The 
same method has been used in many research syntheses, including 
a series of meta-analyse* s conducted by Walberg and his 
associates (e.g., Uguroglu & Walberg, 1979; Pascarella, Walberg, 
Junker, & Haertel , 1981). This strategy has been used in some 
very novel and creative ways in some research syntheses. The 
potential Qf multiple regression methods in research synthesis 
is perhaps best illustrated by the meta-analyses of the effects-** 

*of class size (Glass & Smith, 1979; Smith & Glass, 1980). 
Although the regression method advocated by Glass is 

\appealing, there are at least two problems With the method. 
First, the assumptions of regression analysis are not met since 
the variances of the individual effect size estimates are 
proport ional to 1/n , where n is the sample size of the study . 
Thus when the studies to be integrated have different, sample 
sizes, the individual "error 11 variances may be dramatically 
different. Secondly, even if the regression coefficients are 
properly estimated, Glass's method gives no indication of the 
goodness of fit of the regression model/. That is, there is no 
indication that the model is correctly Specified. 

Hedges (1982c) developed alternative methods for fitting 
models to effect size data when those models include continuous 



42 

or discrete independent variables. These methods provide 
consistent, asymptotically efficient estimates of the parameters 
of the model and also permit large sample tests of significance. 
In addition the methods can provide an explicit test of the 
specification of the model. Thus it is possible to test whether 
or not a model adequately explains the observed variability in 
effect size estimates . 

In this analysis, assume that the standardized mean 
difference for the _i experiment depends on a vector of p 
fixed concomitant variables (x^p x i2' * ' ' ,x ip^ ' ' w ^ ere P £ ^. 
The vectors (xjj , . . . ,x ip ) 1 i = l,,...,k, are denoted x i> and the 
matrix 



IP 



\* i 



is assumed to have rank p. The assumption that X has rank p 

J 

simply assures that none of the column vectors of X is linearly 

redundant. The vector ,...,6 V of regression coefficients is 

i P 

. th 

denoted jj. Thus the standardized mean difference for the ^ 

i 

experiment is therefore 6^ » xj* * x^f^ + ... + x ^pB p * 
•Denoting the vector of effect sizes by i.e., j5' ■ ( 6 1 B • • • f 6^) 
we can write the model for the effect sizes as 



ERLC 



X3. 



(28) 



43 ' • 

Denote the vector of effect size estimates by £ * (g^ , . ,g^) '\ 



/ 

Estimation of 8 



A model for" the estimator j» could be rewritten using g_, X, 
§_, and a residual vector n. as 

x « & - XB + n, 

where has the same distribution as (jj - £.) , i,e., 

where I » diag(c, (6. ) , • • . ,o. 2 (6, ) ) and a ? ( 6 ) is fciven by (8). 

il k k i i 

If the values of cr ( £ ) were known, we could use generalized 

-> i i 

least squares to obtain an estimator of Unfortunately I 

depends on j5 which is unknown. However, it is still possible to 

N obtain estimates of £ by using an estimated cotfarrajice matrix, 

*■ > 

Hedges (1982c) showed that the resulting estimator can be easily 

computed and has the same asymptotic distribution as the maximum 

likel ihood est imator of _6 • Therefore the alternative estimator 

is consistent and asymptotically efficient. This alternative 

estimator is also much easier to compute than is the maximum 
s 

likelihood estimator. 

Define the matrix V(g) as 

V(g) = diag(oj(g 1 ) i .. 1| oj(g k » 
whe/e C (g^ ) is given by (8), An estimator g_ of g_ under 'model 



^ (28) is given by 



1- (X'V" 1 ( £ ,)X)" 1 X , V" 1 (£)&. (29) 



44 

The large sample ' approximation to the distribution of |_ is given 
by 



| * W (8,Z),N ' (30) 

P 



where 



(a St ) s, t 23 1 , . . . , p and a St is est iraated by 



k x. x. 

St r IS It 

0 = I 



i-1 aJ( 8jL ) 



2 -1 -1 

and a ± (g ± ) is given by (8). Alternatively, I = (X f V (^)X) • 

There is also an iterated estimator of ^ that is analogous to 

(14), but the iterated version of 8_ rarely differs appreciably 

from (2) if~the model is "correctly specified. 

The large sample approximation, to distribution of can be 

used to provide approximate confidence intervals for the 

components of 0. That is, if (X'V ^g)*) 1 s (% t )> and 

6- (3 , )', then a 100(1 -a ) per cent confidence interval 

1 p < 

for 6 g is given by 



a/2 ss — s — s a/2 ss 



where z /9 is thV-f50(l - a) per cent critical value of the. 
• normal distribution. The usual theory for the normal 
distribution can be used in conjunction with the Bonferroni 
inequality if simultaneous confidence intervals are desired. 



45 

Sometimes it is useful to te.st the hypothesis that 6 ■ 0, 
that is, that all the components of are simultaneously zero. 
The following statistics provide the basis for conduc ting these 
^ tests. The hypothesis that ^ 38 0 can be tested usinj the 
statistic 

H L = S'X'V" 1 ^. (31) 



If & * 0, H \ has an asymptotic chi-square distribution given by 

The test that £ * 0 at the significance level a therefore 
consists of comparingj the obtained value of Hi*fc6 the 100(1 -a ) 
per cent critical value of the chi-square distribution with p 
degrees of freedom. If the value of H| exceeds the critical 
value, the hypothesis that £ ■ 0 is rejected. Note that the 
statistic H| is analogous to the weighted sum of squares due to 
the regression in weighted least squares. Therefore the test 
that $ *» 0 corresponds to a test that the weighted sum of^ 
squares due to the regression is greater than would be expected 
if 6 » 0. 



46 

Testing Model Specification 

We will argue subsequently that tests of model 
specification are an important step in the analysis of effect 
size data. The test of model specification is a way of 
determining whether the observed effect size estimates are 
reasonably consistent with the model used in the data analysis. 
If the number k of .effect size estimates exceeds the "number of 
predictors p, then a natural test of model specification is 
given by the next theorem. If k > p, the specification of the 
regression model can be tested using the statistic 

\ . • 

H 2 - i'V 1 ^ " H r . (32) 

« 

When J = X6, i.e., the model is correctly specified, H 2 has an 
asymptotic chi-square distribution given by 



H 2 % *k-p 



The test for model specification at a significance level a 
therefore consists of comparing the obtained value of H^with the 
100(1 - O per cent critical value of a chi-square distribution 
with (k - p) degrees of freedom. If the obtained value of H 2 
exceeds the critical value, then model specification is 
rejected. Note that H-> is analogous to the weighted residual sum 
of squares in weighted least squares. Thus the test for model 
specification is a test for greater residual variation than 
would be expected if $. 13 Such a te8t * s not P oss ible * n the 



ERLC 



47 

context of the usual normal theory because the means and 
variances of observations are independent. In the case of 
effort sizes, however, the sampling variance of g^ given in (8) 
is completely determined by the mean of g^ and the sample size. 
Therefore the •'expected 11 residual variation is determined as a 
function of X and g. 

The test for model specification will often be used to 
demonstrate that the data (sample effect sizes) are reasonably 
consistent with the model used in data analysis. * It is 
therefore important to have some understanding of the factors 
affecting the power of the test for model specification. Two 

factors that influence the power of the test. are the number k of 

E " C 

studies and the sample sizes (n^ and n^) of those studies. The 
latter factor (the sample sizes of the studies) is often the 
most significant. The reason is that the specification test / 

statistic H 0 can be loosely described as a sum of squares of 

v .th j . . 

standardized residuals. The residual for the i study xs 



lized" b^ 



'•standardized" by the square root of the sampling variance of 

th EC. 
the effect size estimate. When n^ a n,^ ^n^ this sampling 

variance is approximately 2/n^. Therefore if the sample size ru 

in each group is large, even a small deviation from the modj^l 

may result in a large contribution to the test statistic. 

Similarly, if the within-group sample sizes n^are small, even J 

reasonably large deviations from the model may not yield a large 

"Standardized residual" contribution to the test statistic. 

These arguments can be formalized into a rigorous development of 

power functions under so called local alternative hypotheses, 

but the formal arguments will not be given in this paper. 



48 

» * 

It is not necessarily true that large numbers of studies, 
(large values of k) lead to rejection of model specification. 
The author has seen relatively simple models that fit well with 
over 100 studies, and v raany examples of well specified models for 
40-80 effect sizes. If a particular model does not fit well, 
then diagnostic procedures are called for. Examination of 
residuals is often helpful. Such examinations may reveal 
patterns that suggest variables that should be added to the 
** model. Alternatively, some studies may consistently yield 

effect size estimates that deviate greatly from the prediction 
of the model- and therefore merit closer examination. 



Computing Estimates arid Test Statistics 

The est imates and test statistics presented in this section 
can be easily calculated using any computer program package that 
manipulates matrices (such as SAS Proc Matrix). 1 A simpler 
al ternat ive to ttfe computation of est imates and test statistics 
is the use of a computer program (such as SAS Proc GLM) that can 
perform weighted least squares analyses • 

Weighted least squares involves ^estimation of linear model 
parameters by minimizing a weighted sum of squares of 
differences between observations and estimates. Given a design 
matrix X, a vector* of observations of Y, and a diagonal weight* 
matrix W, the weighted least squares estimate of 6_ in the model 



ERLC 



49 

Y - XJJ is 



3,. = (X'WO'VwY. 
— w ~ 



Note that the form of this estimator is the same as that of 

jj. Thus | is a special case of _§ w where the weight matrix W is 

i th 
given by V (^) . That is, the weight i case is given by 



-2(n. + n. )n.n. 

i 0/ E a, <\ , EC i * - At ■ • • 

2(n. + n.)^*. n.n.g. 



and the weight matrix is W * diag(w lt . . . ,w k ) . 

The estimator 6 is the weighted least squares estimator of 

£ using designmatrix X, data vector £, and the weights Wj 

given above. The large sample covariance matrix of S was given 
previously as (X ■v'^X)" 1 . This large sample covariance matrix 
is given by the weighted sum of squares and cross products 
matrix" (X'WX)* 1 in the weighted least squares. If the computer 
program fits a "no-intercept" model, the test statistic for 
testing that S 11 0 is given by the weighted sum of squares due 
to the regression in the weighted least squares. A similar 
statistic for testing that "all components of 6 except the 
intercept are simultaneously zero is given if the weighted 
least squares program fits an intercept. In the latter case the 
test statistic H 2 will be compared to the critical value of a 
chi-square distribution on k-p-1 degrees of freedom. The test 



ERIC 



0) 



50 • - 

* 

statistic H * for testing model specification will always be the 
value of the weighted sum of squares about the regression line 
(the error sum of squares). Thus all of the statistics described 
in this section may be obtained from a single run of a standard 
packaged computer program. 



The Effects of Measurement Error on Effect Size 

The standardized mean differences defined as in (1), is a 

x 

V 

measure of the magnitude of the treatment effec't compared to, the 

variability within the two groups of the experiment. The* 

implicit assumptioi^ iu that the variability within the 

experimental and cpntrol groups arises from stable difference 

between subjects (or more generally between expeimental units). 

If the response measure is not perfectly- reliable, i.e., if 

errors of measurement are present, then measurement error also 

/ 

contributes to the wi thin-group variability. Measurement error, 
therefore, alters the population value of the standardized mean 
difference. If the object is to estimate the. value, 6 , of the 
standardized mean difference when no errors of measurement are 
present, some procedure to correct for measurement error is 
necessary ^ - 

Consider the population value of the standardized mean 
difference in two cases, one in which the measurements are 
error-free and one in wh-ich errors of measurement are' present . 
For simplicity of notation, the subscript ^denoting the 



ERLC 



51 

particular experiment is omitted in the exposition that follows, 
but the results apply to each experiment when properly indexed. 

If there are no errors oi measurement , then denote the 

2 

within-cell standard deviation bya^. Let 6 denote the population 
value of the standardized mean difference when there are not 
errors of measurement. Then 



EC". . 
where U and U are the population means of the experimental and 

control groups respect i vely . Note that the use of the symbol 5 

is consistent with the definition of 6 used in the' structural 

models CO..- 

In the second case, when errors of measurement- are present, 

EC 

the population means ^ and u are unchanged but the wi thin-group 
variance is larger. If * 1 depotes the value of the standardized 
mean difference when errors of measurement are present , then 



where - is the variance due to errors of measurement . The 
relationship between * and f 1 can be expressed as 



where k is the reliability of the response measure. 

Thus the populat ion value of the standard i zed mean 
difference depends explicitly on the reliability ot the response 



measure. If Che object is to estimate the value of 6 , the* 
standardized mean difference with no errors of measurement, then 
estimation of * 1 instead of- can result in biased estimates. 
Since reliabilities cannot exceed one, the effect of measurement 
error is to reduce the magnitude of the parameter 6 1 compared 
with . In particular, errors of measurement cause the estimator 
g to estimate' ' instead of *» , so -that E(g) = 6 1 = 6> / p"\ Hence 
errors of measurement result in underestimates of the parameter 
• 

> If the reliability i is known, the bias can be removed by 
dividing g by n • When we combine several estimates that use 
response scales with different reliabilities, each estimate can 
be corrected for measurement error separately • Stat ist ical 
analyses can then be carried out using g/*P in place of g and 
(g)/. in place of (g) . 



1 »e Effects of Departures from Linear Equatability 

In the statistical work described earlier we assumed that 
the tests used to measure outcomes in tfoe different studies are 
linearly equatable . In practice this assumption may be only 
approximately true. Some tests may have unique factors iri 
addition to the common factor shared among all tests . Fpr 
example, some experiments may use an expensive standardized test 
to measure reading ach ie vement , whereas other studies use 
locally developed tests that are correlated with the 
standardized test. If the locally developed tests have unique 
factors, they will not be perfectly valid measures of reading 



53 

achievement as measured by the standardized test. This section 
reports some results of Hedges (1981) on the effect of 
invalidity of response measures on estimators of effect size. 

One model for test invalidity assumes t,iat a collection of 
tests share a common factor, but that some tests also have 
unique factOTS. If the populat ion\o£ test scores on a 
particular test are generated by a mddel which includes both the 
common factor (among all the tests) sfnd a unique factor, then 
the test is partially invalid. To examine the effects of partial 
invalidity on effect size, Hedges (1981) derived the 
standardized mean difference 5 ' 1 when tests had unique factors. 

First consider the case where the treatment only affects 
the dependent variable via the common factor. Omitting the 
subscripts we see that the standardized mean difference is 



CO 

fit _J\ 



5 8 n 



vhnre ' *, auu j are the variances accounted tor hv the 

- v ^ n 

common factor, the unique factor, and measurement error, 

V 

respectively. The validity coefficient p of the test can be 
expressed as the within-group correlation of the test with the 
common factor , that is , 



.°j_„ 



54 

Therefore the population value of the standardized mean 
difference £ 11 can be expressed as 

It seems unlikely that the population correlation of an invalid 
test with the common factor among a series of tests would be 
known. If X is' a test that is not perfectly reliable, shares the 

/ Y common factor, but has no unique factor, then the correlation 

V 

- can be obtained from p^y by the familiar disat tenuation 
formula (see, e.g., Lord and'Novick, 1968): 



v , r 



where , is the reliability of the test X.* Thus the population 
standardized mean difference 6 1 1 can be written in terras of a 
correlation with a valid but unreliable test X and the 
reliability^ - of X, namely, * 



Since ? -v7~> it follows that 6 f, <_ 6 , This means that 
invalidity always reduces the standardized mean difference when 
treatment affects only the common factor among the response 
measures. In this case, estimates of effect size may be 
corrected by substituting 8^/»\,y f° r 8* 

When the treatment affects the test Y through both common 
and unique factors, invalidity of the test may either increase 



9 

ERIC 



55 

or decrease the standardized mean difference .« Hence no simple 
characterization of the effect of invalidity on estimates of 6 
obtained from the estimators is possible. In this case the 
standardized mean difference is 



6o r + C 

__-2 „ 



/of + 0^ + 0^ 

S 6 * n 



where ; is the effect of the treatment on the unique factor; and 

, ^* , and it* are the variances due to the common factor, 
unique factor, and measurement error respectively. It £, the 
treatment effect via the unique factor is large enough, namely 



C > (Jo? + a 4 - o c )6 = c , 

5 6 n S c 



then : IM > J . If ; < r, , then s'" < 6 



Statistical Analysis When Correlations or 
Proportions are the Index of Effect Magnitude 
In some cases, the effect size will not be a suitable JLndex 
of effect magnitude for the studies that, the reviewer wishes to 
integrate. For example, the studies may investigate the 
relationship between t;wo continuous variables or they may study 
* the proportion of subjects reaching a criterion in different 

groups. In 'the first example, Glass (1978) suggested the use of 
the correlation coefficient as an index t of effect magnitude. In 



56 

the second example, the difference between tthe proportions of 
subjects reaching the criterion appears to be a natural index of 
effect magnitude. Statistical procedures for combining 
correlation coefficients and differences between proportions 
were developed by Hedges and Olkin (1983)* These procedures are 
analogous to the methods already presented for the analysis of 
effect sizes. One difference is that variance stabilizing 
transformations for correlations and proportions simplify the 
statistical methods, 

• *- 

Statistical Analysis for Correlations as Effect Magnitudes 
Suppose that k independent studies with sample sizes 
n. 9 ,.. 9 n.. yield k independent sample correlation coefficients 

r. ,...,r,. If^, are the population correlations, the 

I k 1 k 

first problem is to decide if the sample correlations could 
reasonably have been drawn from populatip*vs with the same 
underlying population correlation* A test for^ttie homogeneity of 
correlat ions is needed to determine if all the studies share a 
common population correlation. Statistical analyses are 
simplified if the correlations are transformed by Fisher's 
z- trans f ormat ion „ Let 



57 

A test for homogeneity of p. P uses the test statistic 

1 k 

« ■ ^ <0 4 - 3)<z Z .)2, (33) 

where z. is the weighted average of the z given by 

i 



£ k (n. - 3)z. 
i = l 1 - 1 



_k , • (34) , 

I (n - 3) 

If 0 = p,- ••• = P i a °d all tn e n. are moderately large, then 

1 •■ k 1 ■ 

H given in (33) is distributed approximately as a chi-square on 
(k - 1 ) degrees of freedom. Thus the test for homogeneity of the 
correlations consists of computing H and comparing the obtained 
value- to the 100(1 - a ) per cent critical value of the 
chi-square distribution on (k - 1) degrees of freedom. If the 
obtained value of H exceeds the critical value we reject the 
hypothesis of homogeneity at the 100a per cent level. A 
computational formula for H is 



..k 



Z\(n. -3)z. * 

n - 2T (n. - 3)z? - 

i'l 1 1 I (n. - 3) 

i=l 1 



If the correlations are homogeneous, then the natural 
estimate of the z-transforra of the common correlation is z. 



58 

given in (34). This estimate may be concerted into an estimate 

of „ by finding the value of o that yieHs^^as its z-transforra, 
i.e . , 



? ? . 

= tanhu.) - . (35) 

a r 1 



In large samples, z. has a normal distribution, given by 

s. ^ K(C,c?), 

where * is the z-transform of the common correlation p and 

A - - . (36 ) 

L (n, - 3) 

This normal distribution can be used to obtain a large sample 
confidence interval forp. A 100(1 -u) per cent confidence 
interval for"; is given by 

"■1 ' ? ~ " '-a/2 C - - ; Z - ' Z a/2°- 4 h> 



where is given in (36) and z . is the 100(1 - a) per cent 

if I 

critical value of the standard normal table. The 100(1 - w ) per 
cent confidence interval for p is given by 



where tanh(x) is given by (35). 

If a priori knowledge or the formal test of homogeneity 
suggests that the correlations are not homogeneous , then the 
investigator may wish to determine whether various ^ 
characteristics of the studies are related to the cor re la t ion 



/ 



59 



Suppose that C^, the ^-transform of the correlation in the jl 



th 



study depends on a vector of p study characteristics ( x^^ , . . • , x ^p) 1 

where p<k . The vectors (x , ...,x )' , i - l,.,.,k, are denoted 
- il ip 

*x and define the design matrix X as 
~~i 



il 



X 



kl 



*1 ~> 



where X is assumed to have rank p. Define the vectors 

L = U ,...,;)'; z = (z ,...,z )', and 6 = fe , . . . ,8 V- 
1 k *~ 1 k ~ 1 p 

Then the model becomes = x 8 + . . .+ x. fJ > i = 1 3 • . . >k, or 

i . il 1 ip P 



al ternat ively 



X3 



Hedges and Olkin (1983) showed that a natural estimator of 6_ is 



3 - (X'ViO^X'V?., 



(37) 



where V = diag(n - 3,...,n - 3). When all the n. are 'reasonably 
Ik l 

large, 5 has a p-variate normal distribution given by ^ 



S N (3, (X'VX) x ). 
P - 



(38) 



The large sample distribution of can be used to obtain 



ERIC 



60 

confidence intervals for ,...,6. For example, a 

1 p 

(nons iraultaneous ) .100(1 -a) per cent confidence interval for (5 

j 

is given by _y 



ERIC 



j a/2 33 - J - j a/2 33 



th 

where v.. is the j diagonal element of (X'VX) and z is 

JJ a/2 

^ the 100(1 - :0 per cent critical val ue obtained from the standard 
normal distribution. 

A simultaneous test that 8 » 0 uses the test statistic 



\ » S'(X'7X)S, (39) 
which has a chi-square distribution when 6 3 0 given by 

Thus the t^st that £ 21 0 at a significance level a consists of 

comparing the obtained value of Hj with the 100(1 -a ) per cen.t 

1 

.critical value of chi-square with p degrees of freedom. If Hj 
exceeds the critical value we reject the hypothesis that (3 *= 0. 

$ 7 

If k > p, a test of the specification (goodness of fit) of 

the regression model uses the test statistic 

* 

a 3 - H (40) 

When the model is correctly specified, the statistic H« has a 
chi-square distribution given by 

Thus the test of model specification at a significance level a 
consists of comparing the obtained value of Ho with the 



61 

100(1 -** ) per cent critical value of the chi-square distribution 
on (k - p) degrees of freedom. If H <> exceeds the critical value 
we reject the specification of the regression model. Rejection 
of model specification suggests that the model does not account 
for all of the variability of the correlations. In this case it 
may be desirable to think about additional explanatory variables 

or to examine residuals and look for unusual z values to 

i 

determine why the model does not fit well. It should be noted 
that the test for model specification can- be very sensitive when 
sample sizes are large, so even minor deviations from the model 
can result in rejection of model* specif ication, ' 

Statistical Analysis for Differences in Proportions 

Suppose that k independent studies each compare the 

proportion of subjects achieving a criterion in an experimental 

E C ' . 

and a control group. Let p and p denote the sample proportions 

i i 

of subjects reaching a criterion in the experimental and control 

groups respectively of the 1 study and let tt and tt represent 

"~ i i 

the corresponding proportions in, the population. The obvious 

.th 

index of the magnitude of the treatment effect in the Study, 
is the difference between the experimental and control group 
proportions, "V - "V • Statistical analyses are simplified, 
however, if a slightly different index of effect magnitude is 
used. Define the population and sample indices of effect 
magnitude as 



62 

and 



sin" l (/?f) - sin^C/p?),. i - l f ...,k. (42) 



x * 1 



A test of the hypothesis that a w 3 ... a a> uses the test 
\-_/* 1 2 k 

statistic 

< 

n - L zn i (w l - w.) z (43) 

l- : 1 



MCE C . E C 
where n t = n n / (n + n )', n and n. are the experimental and control 

group sample sizes in the study, and w. is the weighted 
average of the w^ given by 

n . v 
.,11 

tf . f _\ / (44) 
,» - 



When = - = = ... .and the sample sizes are all reasonably 

large, H given in (43) has a chi-square distribution on (k -1) 

degrees of freedom* Thus the test of homogeneity of the to 1 8 at 

significance level i consists of comparing the obtained value of 

H with the 100(1 - i) per cent critical value of the chi-square 

distribution with (k - 1) degrees of freedom. Values of H that 

are larger than the critical value result in rejection of the 

hypothesis that all studies share a common effect magnitude. 

It the w . values could reasonably have come from 
i 

populations with the same value of ^ , then the natural estimate 



of * is w. given in (44). In large samples the distribution of 
w". is given by 

v. % V(<i>,o?), 

where 



2 „ J (45) 



i 



i-1 



This large sample distribution can be used to obtain an 
approximate confidence ipterval for u>. A 100(1 -a ) per cent 
confidence in/rerval for ^ is * 

J 

where j.is given in (45) and z is obtained from tables of the 

■ * / 2 

standard normal distribution. 

If a priori knowledge or a formal hypothesis test lead the 

investigator to believe that the u>' s are not homogeneous, then 

the investigator may wish to determine whether various 

characteristics of studies are related to the index w of effect 

magnitude. Suppose that depends on a vector of p study 

characteristics x 1 * (x »...,x V, where p»< k. The vectors 
~ i H ip ~ 

x. define the design matrix 



64 

where X is assumed to have rank p. Define the vectors 

^ 58 ( *h ...,~> )' and w » (w ,...,w V of parameters and estimates 

1 k 1 k 

and let p s ( , . . . , £ ) 1 be the Vector of regression 

1 P ' 

coefficients. The linear model is u> 55 x ft.t ••• + x . 6~ - 

i ill, lp P> i - 

l,...,k, or alternatively u> =» XB. 

Hedges and Olkin (1983) showed that the natural estimator 
of r is 



3 - (x- , vx)~ 1 x'vv, (46) 



where V » 2diag(ri , ...,ri )• When all of the n are reasonably 
i k i 

large, £ has a p-variate normal distribution given by 

• / 

3 ~ V (^(X'YX)" 1 ). 



This large sample distribution of can be used to obtain 
confidence -intervals for %, just as in the case of the analysis 
of correlations. The test that ^ = 0 and the test of model 



IS t l< 



specification based on the statistics 



and 



• • 1 * "„ 

" > *\ * v 



are identical to the analogous tests in the analysis based on 
correlat ions . 



CONCLUSIONS 



There has been some vehement criticism of Glass's 
meta-analysis. Some critics have argued , that meta-analysis may 
lead to oversimplified conclusions about the effect of a 
treatment because it condenses the results of a series of 
st ud ies into a few parameter es t imates . For example , Presby 
(1978) argued that even when studies are grouped according to 
variations in the treatment, reviewers might reasonably disagree 
on the appropriate groupings. Grouping studies into overly 
broad categories and calculating a mean^ effect size for each 
category might serve to wash out real variations among 
treatments in the categories. Thus it would appear ^hat 
variations in treatment w^re unrelated because the mean effect 
sizes for the categories did not differ. An obvious extension 
of this argument is that reviewers might reasonably disagree on 
explanatory variables that could b^ related to effect sizes. 
Hence fa i 1 ure to find var iables that are syst emat ically related 
to effect size does not imply that the effect sizes are 
consistent across studies. It may only imply that the reviewer 
has examined the wrong explanatory varieties. 

A related criticism is that the studies in a collection may 
give fundamentally . different answers (e.g., have different 
population effect sizes) perhaps because of the artifacts of a 
multitude of design flaws (see e.g., Eysenck, 1978)- Any 
analysis of the effect sizes is therefore an analysis of 
estimates influenced by a variety of factors other than the true 
magnitude of the effect of the treatment. Thus meta-analyses may 



be a>>Mtei uisf ot ''garbage in — garbage out. 11 The argument 
underlying this criticism is that flaws in studies may influence 
ct tec t si zes . - 1 / 

Th*» statistical methods presented in this monograph provide 
a ;u*'Chanism for responding to criticisms mentioned previously. 
In the simplest case the reviewer summarizes the results of * a 
series of studies by the average effect size estimate. Is this 

r 

an oversimplification of [the results of the studies? The test 
of homogeneity of effect size provides a method of empirically 
testing whether the variation in effect size estimates is 
greater than would be expected by chance alone. If the 
hypothesis of homogeneity is not rejected, the ^reviewer is in a 
strong position vis-a-vis the argument that studies exhibit real 
variability which is obscured by course grouping. If the model 
of a siugie population effet size fits the data adequately, 
then a desire tor parsimony suggests tnis model should be 
considered seriously. 

Failure to reject the homogeneity of effect sizes from a 
series of studies does not necessarily disarm the criticism that 
th? results of the studies are artifacts of design flaws. For 
example , if a series of studies all share the same flaw, 
consistent results across the series of studies may be an. 
artifact of just that flaw. That is, the design flaw in all of 
the studies mav act to make, the effect sizes in the studies 
consistent with one another and consistently wrong as an 
estimtp of the treatment effect. On the other hand, thje 
stidi'S may not all have'>the same, flaws. If a variety of" 

i 

4 ' 



' 67 

different studies, with different design flaws all yield 
consistent results it may be implausible to explain the 
consistency of the results of a series of studies as a 
conspiracy of different artifacts all yielding the same bias. 
Thus the reviewer who finds consistency in research results and t 
who Knows the limitations of the individual studies may be in a 
strong position against the "garbage in — garbage out" argument. 
I emphas ize that careful examinat ion of the individual research' 
studies and some scrutiny of the attendant design problems is 
essential , Without su£h analysis of the studies , a single 
source of bias is a very real and plausible rival explanation 
for empirical' consistency of research results. 

When a reviewer explains the effect sizes from a series of 
studies via a model involving explanatory variables (e.g., the 
effect size varies according to grade level), tests of model 
specification pl*y % rolp analogou? to "hat of the test of 
homogeneity . It is difficult to argue that add it ional variables 
are needed to explain the variation in effect sizes if the 
specification test suggests that additional variables are not 
needed. 

Evidence that the model is correctly specified does not 

necessarily mean that the artifacts of design flaws may be 

ignored. II all studies ahare a common design flaw then the 

results of all of the studies may be biased to an unknown 

extent . If design 'flaws are correlated with explanatory 

variables, then the effects of those design flaws are confounded 

1 

with the effects of the explanatory variable , It may be 



n8 

dittirult or impossible to determine the real source of the 

effect. However, if several design flaws occur in different 

studies, r*i few of the flaws occur in more than a few studies, 

and if simple models appear to be correctly specified, then it 

seems implausible that inferences drawn about the effect sizes 

are 'artifacts of biases due to design flaws. 

Methods for the quant i tat i ve synthes is of research 

* 

(meta-analysis) have received a great deal of attention. These 
methods are promising improvements in th.e methodology for 
research reviewing. Some authors (Jackson, 1980) have argued 
'that research reviewing in social sciences has not been subject 
to the kind of rigorous professional standards of methodology 
that have been imposed on the conduct of primary research. One 
positive aspect of the interest in meta-analysis has been the 
"movement toward the use of rigorous methods of sampling, 
measurement, and statistics in research reviewing (Cooper, , 
1982). As Glass (1978) pointed out, research reviewing is a 
difficult task that >raands as much creativity and insight as 
the conduct of primary research • The application of rigorous 
methods in research reviews may be a step toward a .time when 
reviewing is accorded equal status* with the generation of 
primary research. 

Statistical techniques for research reviews are not well 
developed. The present survey of methodology is a beginning. New 
methods are clearly needed for the exploratory analysis of 
effect size data. Glass (1978) argued that it is often useful to 
estimate effect sizes from other statistics or from data other 



ERIC 



3 



69 

than means "and standard deviations. Methods for obtaining such 
estimates were discussed in Glass (1978) and in Glass, McGaw, 
and Smith (1981). , Yet very little is known about the properties 
of such estimate^. The development of rigorous statistical 
theory for these estimators would be an important contribution. 

Many research studies provide data on several measures of * 
the same or related constructs. The several effect size 
estimates derived from different measures applied to the same 
individual are therefore correlated. In fact, the vector of 
effect size estimates can be shown to have a multiyariate norm,- 1 
distribution in large samp-les. Moreover, the (large sample) 
correlation matrix of the effect ^izes is the same as, that of 
the original observat ions % . Thus if a reading and a mathematics 
achievement test have a (population) correlation of .7, effect 
sizes derived from these two tests will also have a correlation* 
of .7 in large samples. This, result can be used in the study of 
covariation among effect sizes derived form the same sample, but 
relatively little work has been done. A conservative solution is 
to use the average of multiple effect sizes as the ttnly estimate 
of effect size in statistical analyses. More work is definitely 
needed in the area of the multivariate analysis of effect sizes. 
A related issue is how to handle estimation of effect size from 
a series of correlated estimates. Glass (1978) suggested the use 
of Jackknifc estimators in this case, whicfeseems sensible. This 
problem t>f estimation of effect size from correlated estimates 
"merits further investigation. 

The possibility of influences of bias due to the use of 
statistical significance as a criterion- in editorial decisions 



70 

has been suggested (Sterling, 1959) v Some preliminary work (Lane 
& Durilap , 1978) suggests that the effects of pjiMicatibn bias 
can be severe. We have little evidence about how seriously such 
biases affect quantitative research syntheses. Some empirical 
evidence on this question is discussed in Glass, McGaw, and 
Smith (1981), There are very few statistical methods for dealing 
with" the effects of these biases. Future meta-analyses will, no 
doubt, reveal new methodological problems that also need 
attention. 



v 



/I 

\ 

References 

Brewer, J. K. On the power of statistical tests used in the 
American Educat ional Research Journal . American 
Educational Research Journal, 1972, 391-401. 

Campbell, D. T. Definitional versus multiple operationalism. 
Et al. , 1969, 2, 14-17. 

Chase, L. J. , & Chase, R. B. A. A statistical power analysis of 
applied psychological research. Journal of Applied 
Psychology, 1976, 61_, 234-237. 

> * 
Cochran, W. G. Problems arising in the analysis of a series of 
similar experiments . Journal of the Royal Statistical 
Society Supplement, 1937, 4, 102-118. 

Cochran, G. W. The combinat ion' of estimates from different 
experiments. Biometrics , 1954, L0, 101-129. 

Cohen , J. The Stat ist ical power of abnormal -social psycho- 
logical research: A review. Journal of Abnormal and Social 
Psychology, 1962, 65, 145-153. 

Cohen, J. Statistical power analysis for the behavioral 
sciences . New York : Academic Press , 1979 . 

Cooper, H. M. Scientific guidelines for conducting integrative 
research reviews. Review of Educational Research > 1982, 
52, 291-302. 

Eysenk, H. J. An exercise in mega-silliness . American 
r&yttiui ug i s> i , i?/u, , j i / . 

Gage, N. L. T he scientific basis of the art of teaching. New 
York: Teachers College Press, 1978. 

Giaconia, R. M., & Hedges , L. V. Identifying features of 

ettective open education. Review of Educational .Research, 
1982, 52, (in press). 

Glass, G. V Primary, .secondary, and meta-analysis of research. 
Educat ional Rpsearcher^ 1976, j>> 3-8. 

Glass, G. V Integrating findings: Tt(e meta-analysis of re- 
search. In L. S. Schulman (Ed.) Review of Research in 
Education, 5 r Itasca, 111. : F* E. Peacock, 1978. 

Glass , G . V , McGaw , B. , & Smi th , M. L. Meta-analysis in social 
research . Beverly Hills, Ca . : Sage, 1981. 

Glass , G. V & Smith , M. L. Meta-analysis of the relationship 

between class-size and achievement. Educat ional Eval uat ion 
and Policy Studies, 1979, J_, 2-16. 

Hedges, L. V. Distribution theory for Glass's estimator of 

effect size and related estimators. Journal of Educational 



v 

Statistics, 1981, 6> 107-128. 



/ 

/ 



9 

ERIC 



Hedges, L. V. Estimating effect size from a series of indepen- 
dent experiments. Psychological Bulletin , 1982, 
£2,. 490-499. (a) 

Hedges, L. V. Fitting categorical models to effect sizes from 

a ser ies of experiments . Journal o f Educational Statist ics , 
1982, 7, 119-137 .(b) ~ ^ 

Hedges, L. V. Fitting continuous models to effct size data. 

Journal of Educational Statistics, 1982, 2> (i° press), (c) 

Hedges, L. V 4 . , & Olkin, I. Vote counting methods in research 
synthesis. Psychological Bull^tSn, 1980, JJ8, 359-369. 

Hedges, L. V., & Olkin, I. Regression models in research 

synthesis. Amer ican Statistician, 1983, J7> ^ n P ress )« 

Jackson, G. B. Methods for integrative reviews. Review of 
Educational Research, 1980, 50, 438-A60. 

Katzer, J., & Sodt, J. An analysis of the use of statistical 
testing in communications research $ Journal of 
Communication, 1973, 23, 251-265. 

Kraemer, H. C, & Andrews, G. A nonpararaet ric technique for 
meta-analysis effect size calculation. Psychological 
■ Bulletin, 1982, 91^, 404-412. 

Kulik, J. A., Kulik, C. L. , & Cohen, P. A. A meta-analysis of 
outcome studies of Keller's personalized system of 
ins t r uc t ion . American Psychologist 1 9 79 , 3^ , 
307-318. ' ~ 

Lane, D. M., & Dunlap, W. P. Estimating effect size: Bias 
resulting from the significance criterion in editorial 
decisions. British Journal of Mathematical and 
Statistical Psychology, 1978, 31,, 107-112. 

Li v ' , R, J. , & Smith, P. V. Accumulating evidence: Procedures 
for resolving contradictions among different research 
g in \( i j % Harvard Education Review, 19 71, 41 , 
X29-471. 

Lord, F. L., & Novick , M. R. Statistical theories of mental 
test scores. Reading, Ma'ss . : Addison-Wes ley , 1968 . 

Pascarella, E. T. , Walberg, H. J., Junker, L. K., & Haertel, 

G. D. Continuing motivation in science for early and late 
adolescents. American Educational Research Journal, 
1981 , 18, 439-452. 

Pearson, K. On a method of determining whether a sample of 

size n supposed to have been drawn from a parent population 
having a known probability integral has probably been 
drawn at random. Biometrika, 1933, Z5, 379-410. 



• u 



\ 



73 

pillemer, D. B. , & Light, R. J. Synthesizing outcbmes : How to 

use research evidence from many studies. Harvard Education 
Review; 1980, 50, 176-195, I 

Presby, S. Overly broad categories obscure important 

differences. Americas Psyc hologist, 1978, 33, 514- 
515. L — 

Rosenthal, R. R., & Rubin, D. B. Comparing effect sizes of 
independent studies .Psychological Bulletin, 1982, 92, 

500-504. — ; — L — 

Smith, M. L. t & Glass, G. V Meta-analysis 'of psychotherapy 
outcome studies. American Psychol ogist, 1977, 32 
752-760 . : — 1 * 

Smith, M. L. , & Glass, G. V Meta-analysis of class size and its 
relationship to attitudes and instruction. American 
Educational Research Journal, 1980, 17, 419-43T] 

Sterling, T. D. Publication decisions and their possible effects 
on inferences drawn from testfc of significance — or visa 
versa. Journal of the American Statistical Associa tion, 
1959, 54, 30-34. , ~ : 

Tippett, L. H. C. The methods of sta tistics. New York: Wiley 
1931. 1 — 

Uguroglu, M. E. , & Walberg, H. J. Motiv^Jtion and achievement: A 
quantitative synthesis. American Educational Research 

Trmrnal 1 Q 70 U 
- •* — - > - y ' y i * ~ i ' + w • 

Yates, F. , & Cochran, W. G. The analysis of groups of 

experiments. Journal of Agricultural Science , 1938, 28, 
556--580 . " 




ERICftJl Report 83 



I? 1 



STATISTICAL METHODOLOGY IN META-ANALYSIS 

BY 

Larry V , Hedges 
The University o£ Chicago 



Many problems must be addressed by the reviewer who carries out a 
meta-analysis* These -problems include identifying and obtaining appropriate 
studies, extracting estimates of effect size from the studies, coding or 
classifying studies, analyzing the data, and reporting the results of the 
data analysis* Earlier work by Glass, McGdw, and Smith describes methods 
for dealing with these problems* 

However, since their book, there has been a great deal of interest 
in the development of systematic statistical theory for meta-analysis. 
The purpose of this monograph is to provide a unified treatment of 
rigorous statistical methods for meta-analysis. Theae methods 'provide a 
mechanism for responding to criticisms of meta-analysis, such as that 
meta-analysis may lead to oversimplified conclusions or be influenced 
by design flaws in the original research studies. 



ORDER FORM 



Please send 



copies of ERIC/TM Report 83, "Statistical Methodology 



logy^O 



in Meta-Analysis , " at $7.00. 
Name 



Address 



Zip 



Total enclosed $ 



Return this form to: 
ERIC/TM 

Educational Testing Service 
Princeton, NJ 08541 



RECENT TITLES 

' IN THE ERIC/TM REPORT SERIES 

tt&5 - Statistical Methodology in Meta-Analysis , by Larry V. Hedges. t2/82 
$7.00. 

* 

#82 - Microcomputers in Educational Research, by Craig W. Aohnson, 12/82 
$8*50. 

f/81 - A Bibliography to Accompany the Joint Committee's Standards on 

Educational Evaluation, compiled by Barbara M. Wild^muth . 107p . 
$8.50, * 

//80 - The Evaluation of College Remedial Programs, by .Jeffrey K. Smith 
and others, 12/81. $8.50. ' 

#79 - An Introduction to Rasch's Measurement Model, by Jan-Eric Gustafsson. 
12/81, $5.50. 

#78 - How Attitudes Are Measured; A Review of Investigations of Professional, 
Peer, and Parent Attitudes toward the Handicapped, by Marcia D. Home. 
12/80, $5.50. 

#77 - The Reviewing Processes in Social Science Publications: A Review of * 
Research; by Susan E. Hensley, and Carnct E. Nelson. 12/80, $4.00. 

i 

#76 - Intelligence Testing, Education, £hd Chicanos : An Essay ^in Social 
Inequality, by Adalberto Acquirre Jr. 12/80, $5.50. 

#75 - Contract Grading, by .Hugh Taylor. 12/80, $7.50. 

i 

v #74 - Intelligence, Intelligence Testing and School Practices, by Richard 
DeLisi. 12780, $4.50. ' 

#73 - Measuring Attitudes Toward Reading, by Ira Epstein. 12/80, $9'. 50. 

#72 - Methods of- Identifying Gifted Mitiority Students, by Ernest M. Bernal. 
12/80, $4.50/ 

#71 - Sex Bias in Testing: An Annotated Bibliography by Barbara Hunt. 12/79, 
$5.00. ' ^ * ■ \ 

#70 - The Role of Measurement in the Process of Instruction, by Jeffrey K. 
Smith. 12/79, $3.50. 

#68 - The Educational Implications of Piaget's Theory and Assessment 
Techniques, by Richard DeLisi. 11/79, $5.00. 

#66 - Competency-Based Graduation Requirements: A Point of View, by Mary 
Ann Bunda. 1978, $2.00. 

//65 - The Practice of Evaluation, by Clare Rose and Glenn F. Nyre . 12/77, 
$5.. 00. . 

#63 - Perspectives on Mastery Learning & Mastery Testing, by Jeffrey K. . 
Smith. 1977, $3.00. 



